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Maximum entropy models provide the least constrained probability distributions that reproduce 
statistical properties of experimental datasets. In this work we characterize the learning dynamics 
that maximizes the log-likelihood in the case of large but finite datasets. We first show how the 
steepest descent dynamics is not optimal as it is slowed down by the inhomogeneous curvature of 
the model parameters space. We then provide a way for rectifying this space which relies only on 
dataset properties and does not require large computational efforts. We conclude by solving the 
long-time limit of the parameters dynamics including the randomness generated by the systematic 
use of Gibbs sampling. In this stochastic framework, rather than converging to a fixed point, the 
dynamics reaches a stationary distribution, which for the rectified dynamics reproduces the posterior 
distribution of the parameters. 

We sum up all these insights in a “rectified” Data-Driven algorithm that is fast and by sampling 
from the parameters posterior avoids both under- and over-fitting along all the directions of the 
parameters space. Through the learning of pairwise Ising models from the recording of a large 
population of retina neurons, we show how our algorithm outperforms the steepest descent method. 


Nowadays scientists from many different disciplines 
face the problem of understanding and characterizing 
the behavior of large multi-units complex systems with 
strong correlations [1-8]. Statistical Inference tackles 
these problems by inferring parameters of a chosen, con¬ 
text inspired, probability distributions to obtain mod¬ 
els reproducing the system behavior. The basic strategy 
consists in choosing a model family described by a set 
of parameters and tune them to reproduce the dataset 
properties. 

However, in many cases, due to a substantial unaware¬ 
ness of the system properties or to avoid biasing the 
results, hypotheses on the distribution functional form 
cannot be suggested nor trusted. To overcome this issue, 
the Maximum Entropy (MaxEnt) Principle [9] suggests to 
search for the probability distribution with the largest en¬ 
tropy between those satisfying a set of constraints, which 
force the model distribution to reproduce the experimen¬ 
tal averages of a list of observables. Eor systems of bi¬ 
nary variables a common and fruitful choice is to con¬ 
straint the model distribution to reproduce the experi¬ 
mental single and pairwise correlations. This choice has 
been successfully applied in system neuroscience [3, 10- 
13], gene regulation [14], fitness estimation [15, 16] and 
many others fields of science. Moreover other possibilities 
with different observable lists have been also investigated 
[17, 18], suggesting that depending on the context a care¬ 
ful choice of the observables can improve the inference 
accuracy and predicting power. 

However, once the MaxEnt problem is posed as an in¬ 
ference task, finding its solution can be very hard. Eor 
large system size, in fact, the inference problem cannot 
be solved analytically and specifically devoted algorithms 
are required. The most known and widely used algorithm 
[19] was introduced in the eighties and later developed 
[20, 21]. Other approaches include Selective Cluster Ex¬ 
pansions [22, 23], Minimum Probability Plow [24] and 


several approximation schemes [25-29]. 

In this paper we develop further the approach of [19]. 
Our results are based on an analysis of the geometri¬ 
cal structure of the model parameters space. Eor the 
MaxEnt models inference this space is shown to bene¬ 
fit of peculiar properties, which allow us to introduce a 
novel quasi-Newton method, the Data-Driven algorithm, 
and to completely characterize its long-time learning dy¬ 
namics. As it is affected by the randomness of Mont a 
Carlo estimates, the parameters dynamics is stochas¬ 
tic and eventually converges to a stationary distribu¬ 
tion around the log-likelihood maximum. The presented 
method takes advantage of this randomness and shapes 
the stationary distribution to reproduce the Bayesian 
posterior distribution and thus to sample from it. This 
last feature allows to avoid overfitting and endorses our 
algorithm to be well suited for dataset with highly in¬ 
homogeneous noise. We conclude with a test on biologi¬ 
cal data. 


I. MAXIMUM ENTROPY MODELS 

Systems of interest for MaxEnt approach are composed 
by N units that show a stochastic and coupled behav¬ 
ior. In order to be concrete, in this work we focus on 
binary, O/I, units, but most of the results can be di¬ 
rectly generalized to multiple-state variables, as Potts or 
Poisson models, or even continuous ones. In this frame¬ 
work, datasets are composed by B independent measure¬ 
ments of the synchronous state of the N system units: 

As an example, for a binned spike trains 
recording of several neurons, cri{b) = I may represent the 
activity (spike) of the i-th neuron in the b-th time-bin 
and cri{b) = 0 its silence. 

The first, crucial, step in the MaxEnt analysis is the 
choice of a set of observables. Observables are generic 
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functions of the system units that should be chosen 
in order to catch the way system components interact. 
Strictly speaking, if some statistical feature is relevant 
for the system behavior, the corresponding observable 
should be included in order to force the model to re¬ 
produce that feature. On the other way round, if some 
feature is not essential, by excluding the corresponding 
observable, the model will adapt its behavior consistently 
with the other imposed features. Technically speaking 
the observables will be the sufficient statistics of the 
model probability distribution. As an example, the most 
common choice in the literature considers all the single 
variable terms (cr^) and pairwise products (cr^cTj) and it 
leads to the construction of the well know Disorder Pair¬ 
wise Ising model [3]. This particular choice allows to take 
into account the pairwise correlations between the units 
and allows the model to adapt higher order statistics con¬ 
sistently. 

Moreover, a carefully choice of the observable should 
take into account the quality of the dataset. Noisy ob¬ 
servables for which the average is not significantly es¬ 
timated will induce data overfitting with the risk of 
strongly reducing the model prediction power. From this 
point of view and overloading the jargon of Bayesian In¬ 
ference, the observables choice represents a sort of prior 
term (see sect. IB for details). 

For the sake of generality, we like to present the re¬ 
sults for an arbitrary observables choice and with this 
aim in mind we introduce a generic vector of observ¬ 
ables X)(cr) = {'Ea{cr)}a=ii functions of the system units. 
The choice of the observables vector completely deter¬ 
mines the functional form of the MaxEnt model family 
[9, 22]. In fact, by searching for the probability distri¬ 
bution which has the largest entropy among those that 
reproduce the observables averages, we obtain (see, for 
example [23], for the whole functional calculation): 

Px(a)=exp{I](a).X}/Z[X] (1) 

where Z[K] is a normalization constant, X is a D di¬ 
mensional fields vector, namely the model parameters, 
conjugated to the observable vector X and 5](cr) • X = 
Ea(cr)Xa is the scalar product in Euclidean space[30]. 

As the distribution (1) measures the probability of 
each possible system configurations, we can compute its 
dataset (log-)likelihood: 

B 

/( data I X) ^^lnPxq(6)). (2) 

6=1 


Even if later we will consider posterior sampling, see sect. 
IB, for the moment we restrict the inference task to the 
search for the set of fields X* that maximizes the log- 
likelihood: 


X* 


arg max 



(3) 


l[X] =b( X-P-lnZ[X] ), (4) 


and 

^ 6=1 

are the experimental averages of the observables. In fact, 
finding the fields values X* solving 

v„nx*] = B(^-ga[X*]) =0 , (6) 

where Va = djdXa^ is equivalent to enforce the con¬ 
straints: 

Q[X*]=P (7) 

where 

Q[X] = {S)x = Tv^ [S((t)Px(<t)] . (8) 

are the model averages of the observables. Here and in 
the following (• • • )x means average over the model distri¬ 
bution (1) with fields X and (...) always refers to dataset 
averages. Tr^r means summation over all possible system 
configurations. 

Eor most of the reasonable observables choice and in 
case of good data quality, the solution of the maximiza¬ 
tion problem (3) exists and is unique. However for sake 
of completeness, in app. A we discuss when and how 
multiple solutions could arise. 


A. The geometry of the X- space 

In the following sections we will deal with the non triv¬ 
ial geometry embedding the fields space, the X-space. 
This geometry is described by three matrices: the (nega¬ 
tive) log-likelihood Hessian P[X], the Eisher matrix /[X] 
and the model susceptibility matrix x[X] 

Hal>[X]=-VaVbl[X]/B , (9) 

46[X] = (V„lnPx(a) VftlnPxP))^ , (10) 

X„6[X] = (l]aS6)x - (Sa)x(Sb)x • (H) 

P[X] describes the concavity of the log-likelihood func¬ 
tion, /[X] describes the covariance of the log-likelihood 
gradient, whereas x[X] that of the observables. If in a 
general optimization problem these three matrices differ, 
in the the MaxEnt model inference they coincide. In fact: 

46[X]= (ValnPx(a) VfelnPxP))^ (12) 

= -(VaV6lnPx((T))x (13) 

= -WaVbl[X]/B = Uab[X] (14) 

where in the first equality we use a well know proper¬ 
ties of the Eisher matrix and in the second the fact that 
the log-likelihood is linear in P, the only data-dependent 
quantities. Moreover: 

46[X] = -(V„VblnPx(<T))x (15) 

= VaVb\nZ[X\ (16) 

= (EaEfe)^ - (E„)^(I]6)^ ^ Xab[X]. {17) 


where 
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Indeed: 

Xa6[X] = um = Habm . (18) 

The equations (18) are the keystones of this study. 
They affect the geometry of the X-space and thus the 
inference in a peculiar way. In particular, the first equal¬ 
ity allows us to introduce an efficient inference method, 
whereas the second allows us to completely characterize 
its long-time dynamics. 

B. A Bayesian Framework for the Maximum 
Entropy Models 


II. VANILLA GRADIENT, NEWTON METHOD 
AND DATA-DRIVEN ALGORITHMS 

Many of the algorithms suited for solving the MaxEnt 
inference task performs a dynamics in fields space that, 
starting from an initial condition, flows toward the maxi¬ 
mum of the log-likelihood. In many applications, in fact, 
the log-likelihood gradient is fast to compute or estimate 
and can be used to drive the dynamics to the maximum. 
In this section we first review two of these approaches and 
then we propose the Data-Driven algorithm, the focus of 
this work. 


Until now, we posed the MaxEnt inference as a log- 
likelihood maximization problem, without considering 
that the finite size of the dataset could affect the esti¬ 
mate of the observables mean. The error in these esti¬ 
mates will inevitably result in some uncertainty on the 
fields inference that has to be taken into account. In 
fact, even if a carefully choice of the X avoids to include 
noisy observables, the system heterogeneity will results 
in a different precision in the fields estimation. 

A Bayesian framework including prior and posterior 
distributions will exactly account for this uncertainty. In 
fact, through Bayesian inversion, we can compute the 
posterior distribution of the fields, ( X|data ), from 
the prior distribution on the fields and the likelihood 
function, P(data|X) = Ylb • 


pPost( X I data ) 


P( data I X ) pPrior^ X ) 
Norm 


(19) 


where Norm is a X independent normalization constant. 
The width of the posterior distribution around its maxi¬ 
mum quantifies the intrinsic uncertainty on the fields X 
and can be used to test the robustness of the inference. 
Explicitly, any scientific results based on a particular out¬ 
come of the fields inference should remain valid for all 
the fields sets with large posterior probability. Indeed 
the possibility to sample from the posterior is a powerful 
tool to test the robustness of the system analysis. 

Eor the prior distribution we have two possible ap¬ 
proaches: either include a flat distribution that does 
not depend on the fields value, either include a proba¬ 
bility distribution that reflects some a priori knowledge 
on them. The first approach lets the dataset account for 
the whole uncertainly, and in case of very good dataset 
should be preferred. However, in applications dealing 
with strongly undersampled data and/or when some a 
priori knowledge is available, the second possibility has 
shown to be powerful. 

In this work, as an example to show how to include a 
prior term, we focus on the I/2-regularization of the form: 


plr(x 


) = exp I 





where rj is an arbitrary D-dimensional positive definite 
square matrix and |77| is its determinant. 


A. Review: Vanilla Gradient and Newton Method 
algorithms 

Before introducing the method proposed in this work, 
we like to review two well known inference method. 

Ackley, Hinton and Sejnowski [19] posed the infer¬ 
ence problem as a dynamical process ascending the log- 
likelihood function along the gradient direction: 

Xt+i - Xt = (5xy = a(P - Q[Xt]), (21) 

where a is a learning rate. The algorithm works itera¬ 
tively: at each time-step the computation of the model 
averages Q[Xt] allows to perform the fields update and 
to proceed towards the convergence, which is guaranteed 
for sufficiently small a. We call this approach Vanilla 
(Standard) Gradient (VG) algorithm. 

Although it follows the gradient, VG will not go trough 
the shortest path even for arbitrary small a [31]. The 
reason lies in the geometrical structure induced by the 
curvature of the log-likelihood function, namely its Hes¬ 
sian, see eq. (9). To take into account this geomet¬ 
rical effect we can multiply the log-likelihood gradient 
by the inverse of the Hessian [32] obtaining the well 
known Newton-Raphson method. However as suggested 
by Amari[33, 34], in the curved manifold of the log- 
likelihood the natural local metric is the model Eisher 
matrix /[X], see eq. (10). Indeed, in the geometry in¬ 
duced by /[X], the steepest ascendant direction is 
the contravariant form of the gradient (6) [31]. However 
for the MaxEnt models inference, Hessian and Eisher ma¬ 
trix coincides and so do Newton-Raphson and Natural 
gradient methods. Consequently, we do not need to dis¬ 
tinguish and simply replace the VG update (21) with 

= ax'MXt] • (p - Q[Xt]) (22) 

to obtain both the Newton Method (NM) and the Nat¬ 
ural gradient. The positiveness of y ensures the conver¬ 
gence of this method at least for infinitesimally small a 
(see later). Despite it is optimal, the Newton Method is 
slowed down by the time-consuming estimation and in¬ 
version of x[X] at each update step. In the following we 
suggest a way to bypass this problem. 
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B. The Data-Driven Algorithm. 

As x[X] depends only on model averages of observables 
products, we can approximate its value at the solution 
X = X* with the dataset configurations list: 

Xab[X*]«l^= (23) 

This matrix can be computed before running the in¬ 
ference dynamics and then used to evaluate the fields 
update. The resulting Data-Driven (DD) quasi-Newton 
Method update rule reads: 

5XfD=a—-i-(P-Q[Xt]) . (24) 

The quality of the approximation x[X*] ^ ^ depends 
on two main hypotheses: i) the ability of the MaxEnt 
model to reproduce dataset statistical properties beyond 
the mean of X and in particular the experimental X- 
covariance and ii) the good sampling quality of the ex¬ 
perimental dataset. The first hypothesis, however, was 
already partially assumed when it has been chosen the 
observables to reproduce and so the MaxEnt model to fit. 
Otherwise stated, if the approximation is poor, it means 
that the chosen MaxEnt model was not a good choice. 
The second hypothesis, instead, reflects the quality of 
the whole inference task. In case of strong undersampling 
the inference problem, despite being mathematically well 
posed, is meaningless as the information encoded in the 
dataset does not support the fine tuning of the fields. In 
sect. VI we will give a practical condition to test this 
second hypothesis. 

The quality of the approximation (23) controls the 
speed up factors of the DD algorithm: the better the ap¬ 
proximation is, the better and faster the DD will work. 
On the contrary, when the approximation is poor, the 
DD will not be advantageous with respect to the VG. In 
fact, as ^ is positive definite, no matter how bad the 
approximation is, for sufficiently small a the DD will still 
converge toward the right solution X*. 

In conclusion, for cases where the inference problem is 
meaningful and the consequently the approximation (23) 
is valid, the DD algorithm will speed up the inference, 
whereas in the other cases DD will be useless but not 
counterproductive. 

III. THE LEARNING DYNAMICS 

All the approaches explained before, (VG, NM and 
DD) solve the learning task through a discrete-time dy¬ 
namics in the fields space. At each time step the esti¬ 
mation of the model averages of the observables (Q[X]) 
allows to compute the fields update and continue the dy¬ 
namics. Eor large system size V, however, the exponen¬ 
tial complexity of the problem prevents the exact com¬ 
putation of the observables averages and some approx¬ 
imations are required. A standard, but fruitful, choice 
consists in using Markov-Ghain Monte-Garlo (MG) with 


Metropolis algorithm to sample M system configurations 
and use them to estimate: 

(25) 

^ 6=1 

is now a random variable approximating Q[X] up 
to 0{^/l/M) fluctuations. 

Even for large M, the randomness of Qx^ will even¬ 
tually affect the convergence of the dynamics, as for X^ 
sufficiently close to X*, the size of the gradient will be¬ 
come comparable with its fluctuations. However, at the 
beginning, we expect the dynamics to be almost deter¬ 
ministic and then to become stochastic only at the end. 
We indeed separate the dynamics in two regimes: 

1. Approaching the convergence^ when P — Qx^ is 
small but still much larger than its fluctuations. 

2. The long-tim e s tochastic dynamics^ when 

P — Qx^ — and thus comparable with its 

fluctuations. 

In appendices B and G we will provide several details 
of the two regimes and here we only present the mayor 
results. 


A. Approaching the convergence 

After an initial transient where the dynamics strongly 
depends on the chosen algorithm (VG,NM or DD) and 
on the initial conditions, we expect the X^ to approach 
the log-likelihood maximum X*, so that we can approx¬ 
imate the log-likelihood function up to the quadratic or¬ 
der. Given 6l[X.] = /[X] - /[X*], we have 

(5I[X] (X-X*) • x[X1 • (X-X*) 

~_| (X-X*) ■ IT ■ (X-X*) . (26) 

In this approximation the dynamics is exactly solvable 
upon projecting the fields on the x[X*] Eigenvectors 
{vn?=i: SXi^ = j:aVa^SX,,t. 

Along a /i-Eigenspace the convergence of the VG al¬ 
gorithm scales with the corresponding Eigenvalue as: 

r\j (1 — Gonsequently, to ensure the algo¬ 

rithm convergence, we need a < 2/A^ along all direc¬ 
tions. Moreover, by optimizing the convergence speed 
along all directions simultaneously we obtain <^^best ~ 
2/(A> +A<), where A>/< are the largest/smallest Eigen¬ 
value. In App. B we present some details and here we 
simply notice how (^^best squeezed to very small 

values by a large A> preventing the learning along all 
the direction with A^ ^ A>. Gonsequently, for dataset 
where the Eigenvalues of x[X*], or of its approximation 
AG? spread over several order of magnitude the conver¬ 
gence of the VG algorithm will be very slow. 
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Because of the quadratic approximation, the NM and directions with large (small) are larger (smaller) than 
the DD algorithms coincide and we do not distinguish the DD ones. These differences will have consequences on 
between them. Within their dynamics, SX^ ^ {1 — aY the ability of both algorithms to reproduce the dataset 
independently of the Eigenvalues A^. consequently a < 2 statistics and thus to avoid both under- and over-fitting. 

is the only convergence condition and ot^BEST ~ 


B. The long-time stochastic dynamics 


Before proceeding we like to introduce the shortcut 
notation: 


A/’[m; s](t) = 


p-hT^abita-ma)sJitt-mb) 


a normal distribution with average m and covariance s 
evaluated at t. 

When X ^ X*, we expect = B{P - Q^^) ^ 

0 on ly on average, with not negligible fluctuations of 
0{^/l/M) . In this regime the fields dynamics is not 
anymore deterministic and the convergence becomes a 
stochastic process. Consequently, rather than converge 
to a fixed point, the fields will approach an equilibrium 
stationary regime around X*. 

The stochastic process is ruled by the discrete-time 
master equation: 


C. The dynamics under an external stochastic force 

For forthcoming purposes, here we characterize the 
learning dynamics when a linear stochastic force term 
is added to the gradient term in the learning rules. This 
analysis will be useful when a L2 prior term is included 
in the inference procedure. We modify eq. (24) as: 

= a + F^ ) . (33) 

where 

= ~ +»? (34) 

P(F^)=A/-[-,?.X;|:](F^). (35) 

The calculation for the stationary fields distribution fol¬ 
lows as in the previous section and it results in: 

pDD’’(X) = A/-[x;; _^^^^-i](X), (36) 


Pt+i(X') = JDX Pt(X) W^x^x', (28) 

where the transition rates depend on the distribution of 

In the large M limit, the gradient distribution can be 
approximated as a Gaussian to obtain an analytic ex¬ 
pression of see app. C. By asking Pt{X) to 

be invariant under the evolution (28) we can obtain the 
stationary distribution Poo{X.): 

P^«(X) =A/-[x*;^(2(5d -a^)“'](X), (29) 

pS^(X) 

where Sb is the identity matrix in dimension D. Here the 
typical fluctuations of X around X* must consistently 
verify the approximation X ^ X*: ((X —X*)^) 

should be small enough to allow the expansion (26). 

In the stationary regime, on top of the fluctuations 
induced by the MC, the X distribution will induce a sec¬ 
ond source of noise in the actual distribution of 
On average we expect (see app. C): 

P™{Q“<=) = A^[P; ,qMO),3i) 

pDD(QMO) . (32) 

Interestingly, the two algorithms provide different 
distributions. In particular, the VG fluctuations along 


where 

x; = ^-F^.x*. (37) 

Analogously to (32), from (36) it follows: 

pDD,( qMC [p; JYX) ] ■ 

(38) 

Expressions for the VG algorithm can be easily obtained 
from (36) and (38) with substitution a aXr]~^ • 

IV. AVOIDING UNDER- AND OVER-FITTING 
BY SAMPLING FROM THE POSTERIOR 
DISTRIBUTION. 

In the previous section we shown how any actual imple¬ 
mentation of the algorithm is stochastic. In particular, 
depending on the chosen algorithm and its parameters, 
a and M, we expect a whole probability distribution on 
the output fields X. This stochasticity raises questions 
on which implementation should be preferred in practical 
applications. The simple strategy of trying to obtain the 
best possible approximation of X* by increasing M or 
reducing a will unambiguously lead to data over-fitting 
thus limiting the model prediction power. The presence 
of noise in the estimation of P induces some uncertainty 
on the model fields X that should be taken into account 
to avoid over-fitting. 

This fields uncertainty is quantified by the posterior 
distribution of the fields given the data, see eq. (19). 
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Within the approximation (26), the posterior simplifies 
to 


pPost( X I data ) oc g-f -(X-x*) 

in the case of flat prior and to 

X I data ) oc e-f [ (x-x*).^.(x-x*)+x.^.x ] 

(40) 

when an L2 prior, see eq (20), is considered (X^ and W ?7 
are those defined in eqs. (37) and (34)). 

If by tuning the algorithm implementation and settings 
we can match the a stationary fields distribution and the 
posterior, we will be able to sample from it, thus avoiding 
any over- and/or under-fitting. In case of flat prior, we 
have to compare (39) with (29) and (30), whereas in case 
of 1/2 prior (40) with (36) for the DD algorithm and the 
analogous for the VG one: by setting 


2M 

B^M 


(41) 


the DD algorithm will sample from the correct posterior 
distribution and over- and under-fitting will be avoided 
along all the D dimensions. On the contrary, for the VG 
algorithm such a setting does not exist and the algorithm 
will not sample from the posterior. In particular direc¬ 
tions with <C 2M/ {B M) are strongly over-fitted 
and those with a\^ ^ 2M/{B + M) are strongly under¬ 
fitted. Otherwise stated: by choosing the VG algorithm 
with a small enough to well reproduce directions with 
large will result in overfitting of all directions with 
small A^. 

Among the consequences of overfitting, for finite B 
we expect an overestimation of the observed model log- 
likelihood. In fact, as the exact inference fields X* repro¬ 
duce also the noise in the experimental averages P, we 
expect /* = /[X*] > /, where I is the log-likelihood of the 
model that generates the data, the true log-likelihood. In 
order to quantify this effect, in appendix D we compute 
the average log-likelihood estimation in the case of the ex¬ 
act inference when the data are synthetically generated 
by a MaxEnt model with true fields X. By averaging 
over the distribution of P, we find that 


in -1 = 


D 

W ’ 


(42) 


which shows how the log-likelihood maximization, see eq. 
(3), induces a finite bias leading to a log-likelihood over¬ 
estimation. In the case of the DD algorithm, instead, the 
average over the posterior distribution exactly cancels the 
bias, and the true log-likelihood value is recovered. 


V. ALGORITHM IMPLEMENTATION 

The core of DD algorithm is to iteratively update the 
fields Xt with the rule (24), where Q[Xt] is approximated 


by Qx^ through a MG sampling of M configurations. 
Ideally a = Q^best = ^ M = B is the fastest setting 
that satisfies the condition (41). However any practical 
implementation will face two main difficulties. The first 
lies in the fact that the quadratic approximation (26) 
may not be valid. This will happens at the beginning of 
the dynamics, when X^ if far from X*, but also when B 
is not large enough to have (Xt — X*)^ small enough to 
discard third order terms. If it is the case, the distribu¬ 
tion (30) will be non-stationary. For this reason we allow 
the algorithm to adapt the value of a at each time-step. 
The second difficulty reflects the fact that the algorithm 
is not suppose to converge to X* but rather to a probabil¬ 
ity distribution. Gonsequently we need a condition that 
signals the onset of the thermalization and allows us to 
start storing X^s as samples from the posterior distribu¬ 
tion. For this reason we introduce the following quantity: 



(43) 


Under the distribution (32) with a = 1 and M = H, eoo 
will have distribution: 


-P{eoo) 


D\ o# 


' exp{-^eL} 


r(f) 


where r(x) is the Gamma function, eoo has: 

Vd r(f) 

/ {slo) - = /l - 


1 


(44) 


(45) 

(46) 


In the large D limit, the eoo-distribution shrinks to a 
Dirac-delta function at eoo = 1- if before thermalization 
we expect with high probability ^ 1, once the algo¬ 
rithm starts sampling from the stationary distribution 
(30) we expect ~ 1. Gonsequently, once the condition 
St < I is full filled, the subsequent X^ are good estima¬ 
tions of the fields. In order to effectively sample from 
(30) it will be still necessary to keep running the algo¬ 
rithm in order to decorrelate from the initial condition 
Xo. 

The DD algorithm that we implemented can be 
sketched as follows: 


1. Ghose an initial configuration for Xq com¬ 
pute/evaluate Q[Xo], see eq. (8) and compute 
So from eq. (43). Then set Ofo = 1 and Mq = 
min(4,B). 

2. Iterate the following step: 

(a) update the Xt through eq. (24), 

(b) estimate Q[Xt] though Mf = min(^-,H) 
Gibbs samplings, see eq. (25), 

(c) compute St from eq. (43) and 
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i. if St < St-i accept the update and set 

at = at-i6^ 

ii. otherwise discard the update, set at = 
at-i/6~ and estimate again Q[Xt]. 

3. As soon as the condition < 1 is full filled, 

(a) either fix a and let the distribution Pt(X) 
decorrelate from Xq and thermalize to 
Poo(X), see (30). 

(b) either stop the algorithm and retain X^ as a 
fields list solving the inference problem. 

A variable and adapting a is required because the system 
can be be far outside the validity range of (26). In order 
to avoid cycles, we heuristically set = 1.05 and S~ = 

V2. 

The choice of an adapting Mt = min(^,P) < P, 
instead, allows to save time during the deterministic 
dynamics regime, see sect. Ill A, when the algorithm 
does not require a high precision in the gradient esti¬ 
mate. in fact measures also the norm of the log- 
likelihood gradient in the metric defined by its fluctu¬ 
ations, P2j[x]/M = P2^[X]/M ^ B‘^-x/M, see eq. 
(C4): 

= ■ (47) 

For Mt = Bje^ we expect the estimate of the gradient to 
be statistically confident within one standard deviation 
and consequently enough well estimated. Moreover the 
randomness induced by small Mt values decorrelates the 
dynamics from the initial condition. When the algorithm 
approaches convergence, St ^ 1 and consequently Mt 
B thus full filling condition (41). 

The complexity of the algorithm depends linearly on 
NB through the MC sampling. From the tests we per¬ 
formed the number of required steps depends mostly on 
the dataset properties and in particular on the exactness 
of x[X*] ^ The hardest limit of the DD lies in the 
memory allocation for storing Working in double 

precision, for an hardware with 32Gb of RAM the maxi¬ 
mum number of manageable units is A" ^ 350 [35]. 
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FIG. 1. (Colors online) Eigenvalues distribution of the em¬ 
pirical susceptibility matrix, see eq. (23), for a pairwise Ising 
model inference problem on biological data. As often happens 
in biological problems [36], the Eigenvalues span several order 
of magnitude enhancing the heterogeneity of the parameter 
space. As will be show in the inset of Eig. 3, the suscep¬ 
tibility matrix of the inferred Ising model has a very similar 
spectrum. Consequently, even with the optimal learning rate, 
a^BEST ^4- (^3)’ very slow in learning along 

the directions with the smallest Eigenvalues. As the sensi¬ 
tivity (1/R) lies well below the smallest Eigenvalue all the 
susceptibility spectrum is informative and has to be repro¬ 
duced by the inference. See text for more details. Data from 
a ~2.1/i (binned at 16ms) recording of 95 rat retinal ganglion 
cells subject to visual stimulation [12]. 


which, remarkably, does not require the explicit inver¬ 
sion of the matrix Moreover, as ^ is, by construc¬ 
tion, non-negative, ~x^ is a positive matrix and can be 
inverted. As a consequence, the DD algorithm regular¬ 
ized with L2 prior can be applied when an undersam¬ 
pling induces zero modes in the empirical estimation of 
the susceptibility matrix. 


VI. TEST 


A. The L2 prior term 

In order to include a L2 prior term we add to the dy¬ 
namics the stochastic force term introduced in sect. Ill C. 
At each iteration, we should modify the update rule by 
adding to the gradient term independent real¬ 

ization of the random variable according to the dis¬ 
tribution (35). As the prior term will biases the inference 
we need to modify the function (43) in order to account 
for the difference between (32) and (38): 



( 48 ) 


As explained before, the DD algorithm performance 
depends mostly on how good is the chosen MaxEnt model 
in modeling the dataset statistical properties. As ex¬ 
pected, we succeed in inferring back the fields X from 
synthetic dataset obtained through simulation of Max¬ 
Ent models. Indeed we find more interesting to present 
an application to biological data where the underlying 
statistical distribution does not belong to class of models 
considered in the Inference task. 

We tested the DD algorithm on an ex-vivo multi¬ 
electrode array recording [37] of 95 rat retinal ganglion 
cells [12]. The retina was stimulated though a video 
showing two randomly moving bars [12] displayed with 
a frame rate of 6077^. The ^ 2.Ih spike trains record¬ 
ing was binned at 16ms to obtain B ^ 4.8 • 10^ system 












FIG. 2. (Colors online) Behavior of the pairwise Ising model 
inference of biological data through the Data-Driven algo¬ 
rithm. St (lower blue curve, left axis) and of at (upper red 
curve,right axis) during the algorithm running plotted against 
the iterative step number. Abscissa is not proportional to the 
running time as each step performs a different number M 
of MC sampling, see text, at is raised (lowered) at every 
step where St decreases (increases). The algorithm requires 
100 ± 5s to produce a set of fields X satisfying the early-stop 
condition St < 1- Dataset as in fig. 1. 



Data 

FIG. 3. (Colors online) Scatter-plot of the experimental con¬ 
nected correlations, aj = {crtaj) data - (cr^) data( cr^) data, 
against those estimated with MC sampling of M = B = 
4.8 10^ equilibrium configurations of the inferred model dis¬ 
tribution. Inset: scatter-plot of the ordered Eigenvalues of 
DC (x-axis) against those of x[X*]. Dataset as in fig. 1. 


configurations, where we assign (7^(6) = 1 if cell i emit¬ 
ted at least one action potential in the time-bin b and 
cFi{b) = 0 otherwise. We infer a pairwise Ising model thus 
restricting the observables list to single and pairwise cor- 
relations ({Sa}f=i = and con- 

sistently the fields to biases and pairwise interactions 
{{Xa}a=i = {{hiW=i,{JijW<j=i})- With this observ- 
ables choice the model takes the form of the well known 
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FIG. 4. (Colors online) Scatterplots of the inferred couplings 
(main panel) and biases (inset) with the DD algorithm against 
those of Naive (blue triangles) and Resummed (red circles) 
Mean-Field. The black lines show equality. All the approx¬ 
imations fails in the reconstruction with a tendency to over¬ 
estimate both couplings and fields (notice the different axis 
scale). Dataset as in fig. 1. 
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FIG. 5. (Colors online) Scatterplot of the inferred couplings 
(main panel) and biases (inset) for a pairwise Ising model with 
and without L2 prior term, see eq. (20), with 77 = 5 - Su- 
Blue triangles compare the inferred fields with (x-axis) and 
without (y-axis) prior term. Red circles compare the inferred 
fields with regularization (x-axis) with the expected ones (y- 
axis): X^^ = ‘ AC ' X^^, see eq. (37). The black line 

shows equality. As a consequence of the L2 regularization the 
inferred fields are smaller in absolute value. Dataset as in fig. 
1 . 


Disordered Ising Model: 

-Ph.j(cr) = exp I ^ hiai + ^ Jijaiaj | / Z[h, J] (49) 

i i<j 

where Z[h, J] is the normalization constant. 

Before performing the inference it is important to check 
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whether the the number of measurements is large enough 
to empirically estimate the observables averages with 
good precision. In the large B limit, we expect an empir¬ 
ical error in the estimate of P to have zero mean and 
covariance X jB^ see eq. (D2). By linear regression, en 
error ^P in the data will induce en error: 


= 


/<^Q[x] 


V (5X 


5F 


X 


• (5P (50) 


ix=x* 

on the inferred fields. Consequently, by approximating 
X 

P(,5X)=A/-[0; (B^)-q((5X) . (51) 

To ask for this error to be small, we should indeed require: 

1 


1 , for all yU 


B 


< A; 


MIN 


(52) 


As shown in Fig. 1, in the considered dataset, condi¬ 
tion (52) is satisfied: the experimental sensitivity, 1/5 is 
much smaller then the smaller Eigenvalue of AC- 

On an Intel® Core^^ 17-4770 with eight cores at 
3.4GHz the written in Matlab® DD algorithm takes 
100 ± 55 to reach convergence. In Fig. 2 we show the 
behavior of both and at against the number of infer¬ 
ence steps. As the number M of MC samplings is not 
constant during the inference, steps are of different time 
duration. 

In order to show the quality of the inferred fields, in 
the main panel of Fig. 3 we scatter-plot the experimental 
connected correlations against those estimated through 
B sampling of the inferred Ising model distribution. In 
order to give an insight on the validity of (23), in the 
inset we scatter-plot the ordered Eigenvalues of AC (^- 
axis) against those of x[X*]. 

To convince the reader that the tested inference prob¬ 
lem was difficult, we apply two example Mean-Field ap¬ 
proaches to the dataset. In Fig. 4 we show the scatter- 
plot of the inferred fields against those of Naive [25] and 
Resummed [29] mean-field approximations. Despite Re- 
summed works much better than Naive, for this dataset 
both approximations largely overestimate couplings and 
biases. 

For comparison, we tested the VG algorithm on the 
same dataset with M = B and various a < c^^best' With 
a = the VG takes ^ 4.2 • 10^5 to converge, 

whereas for larger a values it was not able to satisfies 
the early-stopping condition e < 1. 

In Fig. 5 we scatter plot in blue the inferred fields 
against those inferred with a L2 regularization term, see 
eq. (20), with = 5 • 10“^ whereas in red we scatter 
the inferred with prior fields against the expected ones, 
see eq. (37). 

A part from the retina example, we succeeded in infer¬ 
ring the Ising model fields from synthetic data, human 
temporal cortex recording [38] , rat pre-frontal cortex 
recording [4] and others. In all the tested cases conver¬ 
gence times are of the order of tens to few hundreds of 
seconds. 


VII. CONCLUSIONS AND DISCUSSION 

In this study we introduced a Markov-Chain Monte- 
Carlo based algorithm for solving the MaxEnt inference 
problem and sampling from the posterior probability dis¬ 
tribution of the MaxEnt fields X, namely couplings and 
biases of an Ising model or their generalization in the 
presence of non-pairwise interactions. We carefully an¬ 
alyze the learning dynamics and separate two different 
regimes: i) a deterministic dynamics that approaches the 
solution and ii) a stochastic dynamics of the probability 
distribution of the fields, Pt(X), that thermalizes to a sta¬ 
tionary distribution. By tuning the algorithm settings, 
namely a and M, this distribution can reproduce the 
posterior distribution of the inference fields thus allow¬ 
ing to sample. We concluded presenting an implementa¬ 
tion of the algorithm and a test on a biological dataset 
showing how the presented algorithm outperforms the 
standard learning approach. The core of the algorithm 
is the approximation (23) which requires at first to have 
enough data to properly estimate the empirical suscepti¬ 
bility then that the probability model chosen for the in¬ 
ference reproduces quite well the data statistics. As the 
applications to biological data have shown, both condi¬ 
tions have not to be intended as rigid constraints, but 
rather as requirements for obtaining the largest advan¬ 
tage of the DD approach. 

The key properties of the MaxEnt inference that driven 
this study are the relationships (18). The equality among 
the log-likelihood Hessian and the model Fisher allows to 
tune the MG fluctuations of the log-likelihood gradient 
(ruled by the Fisher) in order to reproduce the fluctua¬ 
tions of Posterior distribution around the log-likelihood 
maximum (ruled by the Hessian). The equality with the 
susceptibility matrix allows to obtain an expression of 
both the Fisher and Hessian that depends only on X- 
dependent averages of X-independent functions that at 
the solution X* can be approximated through the data. 

In sect. HI, we have carefully analyzed the learning dy¬ 
namics and in HI B we have characterized it as a stochas¬ 
tic process that converges to a stationary distribution. 
Later, in sect. IV, we have shown how the DD algorithm 
stationary distribution can be tuned to the posterior dis¬ 
tribution. This characterization has several advantages: 

• the computation of P^^(Q^^), together with the 
running evaluation of e^, see eq. (43) provides an 
useful early-stopping condition for the algorithm 
that allows to check when the inference has been ac¬ 
complished. Gonsequently it is useful to save much 
computational time. 

• any inference algorithm has to account for the ran¬ 
domness of the MGMG fluctuation and typically 
large computational efforts are required to get rid 
of this noise source (large number of MGMG sam¬ 
plings M) . The DD algorithm, instead, takes ad¬ 
vantage of these fluctuations to avoid overfitting 
and to decorrelate from the initial condition. 
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• as we have shown in appendix D, in comparison 
with the exact inference, the posterior sampling al¬ 
lows to avoid overfitting the model. 

• as in the case of the VG algorithm, an overfitting 
along several directions happens with all the algo¬ 
rithms that do not rectify the fields space. 

• as explained in sect. IB, the posterior sampling 
can provide several consistent lists of inferred fields, 
thus allowing to test the robustness against the 
dataset noise of any analysis based on the inferred 
fields. 

Our approach and the use of the MC randomness to 
induce noise in the algorithm outcome may remind the 
application of stochastic gradient method for sampling 
from the fields posterior distribution [39]. Despite the 
results may look similar, in order the induce fluctuations, 
here we take advantage from the intrinsic algorithm ran¬ 
domness instead of introducing it by sub-sampling the 
dataset. The mayor advantage results from the possibil¬ 
ity to select and optimize properly the algorithm setting 
in order to get rid of the MC induced noise that oth¬ 
erwise will affect the dynamics in a spurious way. One 
important drawback arises from the fact that the algo¬ 
rithm fluctuations may not reproduce the experimental 
ones when the Gaussian approximation (26) is not valid 
and second order terms are not enough to approximate 
the log-likelihood function around the solution. In this 
case, however, one can notice how the equalities (18) can 
be extended up to higher order cumulants. These rela¬ 
tionships, together with an appropriate update rule that 
takes into account higher order corrections may extend 
the equivalence between Poo(X) and the posterior. We 
let this generalization for forthcoming investigations. 

The DD approach takes its place in the list of algo¬ 
rithms for exactly solving the MaxEnt model inference 
problem. Its greatest strengths are the velocity and 
the possibility to sample from the posterior, whereas its 
stronger limitation is the memory requirement for the 
storage of As an example, the inference of a = 350 
pairwise Ising model requires 32Gb of RAM. 

Depending on the dataset properties the DD should 
or should not be preferred to others algorithms as [21], 
[22] or [24]. In general, if the condition (52) is largely 
satisfied, we expect DD to be the best choice, because 
the approximation (23) is expected to be valid. In the 
opposite case, a largely unsatisfied (52) suggests that the 
inference of the chosen MaxEnt model is not meaning¬ 
ful and the observables list has to be modified. However 
for cases in between some tests with different algorithms 
have to be performed. In particular cases, some available 
a priori knowledge of the system could help in the algo¬ 
rithm choice. Eor example, if the underlying interaction 
graph is naturally clusterized in almost non-interacting 
subcomponents. Selective Gluster Expansion (SGE) [22] 
is probably the best choice. SGE, in fact, splits the sys¬ 
tem in many subunits that are recursively joined together 


to form larger and larger building blocks of the recon¬ 
structed interaction network. If the interaction graph 
is clusterizable, SGE will recognizes these units and ac¬ 
complish the inference task quickly. However, up to our 
knowledge, a simple and generic argument for choosing 
the best suited algorithm for the actual inference problem 
is still missing and it would be of large interest. 
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Appendix A: The existence and uniqueness of the 
inference solution 

Because the matrix H[X], see eq. (9), measures the 
concavity of maximization problem (3), its positiveness 
guarantees the existence and uniqueness of the inference 
solution X*. As H[X] can be expressed as the covariance 
matrix x[X], see eq. (18), it is by construction non¬ 
negative, but it could have zero modes. A zero mode 
in a covariance matrix identifies a linear combination of 
the observables, the corresponding Eigenvector, that does 
not fluctuate within the model probability distribution. 
This could happens either for all the values of the param¬ 
eter X when some of the functions X(cr) are linearly de¬ 
pendent either at the solution X* when some of the fields 
diverge quenching (part of) the system[19, 23]. This last 
case usually happens when the dataset suffers of unser- 
sampling. 

As an example, consider a dataset of two non constant 
spins cri{b) and ( 72 ( 6 ) that within the dataset are never 
active together, so that cri{b)a 2 {b) = 0 for b = 1 ,..., H. 
If we take only Ei(cr) = ai and E 2 (cr) = <72 as observ¬ 
ables, the inference problem is well posed. However if we 
include, for example, T^ 3 {cr) = ai + (72 or T^ 4 ,{cr) = ( 7 i (72 
the matrix x[X*] will develop zero modes. In the first 
case, because Ti 3 {cr) = Ei(( 7 ) + ^ 2 ( 0 ') in the latter case 
because X 4 —00 in order to fix (E4((7))x* = 0. 


Appendix B: On the deterministic convergence of 
the Vanilla Gradient algorithm. 

As introduced in sect. HI A, for X X* the deter¬ 
ministic dynamics of the VG algorithm is exactly solv¬ 
able upon projecting the fields on the x[^*] Eigenvec¬ 
tors. Along a /i-Eigenspace the convergence of the VG 
algorithm is not uniform and scales with the correspond¬ 
ing Eigenvalue as (1 — aX^y. Indeed by tuning a to 
speed up some particular direction, the others can suffers 
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of very low convergence. The learning rate that optimize 
the convergence speed along all direction simultaneously 
is 


^BEST 


max 


= arg mm 

a 

= arg min max 


T — aX,j 


1 — (aA<^ 


(Bl) 

1 — (aA> I ] 


where A>/< are the largest/smallest Eigenvalue. This 
equation can be solved by equating the two expression in 
the max: 


1 _ A - 

^BEST^< ~ 


1 — a 


(VG) . 
BEST^> 


and the solution reads: 


^BEST ~ 


2 

A> + A< 


(B2) 

(B3) 


In particular, ol^best squeezed to small value by 

large A> preventing the learning along the direction with 
^ As an example, for the biological data we will 
consider, see Fig. 1, A>/A< ^ 10^. Moreover the ratio 
A>/A< has been shown to diverge in synthetic data of 
model at criticality [36]. 


Appendix C: The stationary distribution of the 
stochastic dynamics 


As introduced in sect. Ill B, the stochastic dynamics of 
the fields X is ruled by the discrete-time master equation: 

Pt+i(x') = J DX Pt(X) 

where the transition rates depend on the distribution of 

As are the average of the observables X, for large 
M we expect them to be almost Gaussian distributed 
with a covariance equal to the X covariance, namely 
x[X], divided by the number of MC measurements: 






w 


Q[X] 



(Cl) 


Consequently, as V/x^ = B(P — Qx^) we have: 




V/[X] 


x[X] 

M 


{VI 


MC\ 

X ) 


If the truncation (26) is valid and x[X ^ X*] « X , we 
can replace the gradient mean by the derivative of the 
approximated log-likelihood. 


The transition rates ITx^x' are the probability to 
measure a value of such that the next fields value 
in the dynamics is X'. From eqs. (21) and (24) it fol¬ 
lows: 




.=^-[-(X.-X);i](2^) (C4) 


H'xSx. 




—(X' - X) 


a 


)(C5) 


By asking Pt(X) to be invariant under the evolution 
(28) we can obtain the stationary distribution Pqo{'K): 


P^^(X)=V 

PDD(X)=V|X*; 


X*-,—{2Sj,-a-x 


a 


M{2 - a) 


X 


(X), 

J 

-1|(X), 


where 5d the identity matrix in dimension D. Here the 
typical fluctuations of X around X* must consistently 
verify the approximation X ^ X*: ((X —X*)^) 

' ' ^oo 

should be small enough to allow the expansion (26). 

We can obtain conditions on the algorithm convergence 
by requiring Poo{X) to be a properly defined probability 
distribution, namely to be integrable. Remarkably by 
asking Poo{X) to have a positive covariance, we re-obtain 
the upper-bounds for the learning rate a\ aX^ < 2, for 
all /i, for the VG and g < 2 for the DD. 

In the stationary regime, the fluctuating fields X will 
induce a second source of noise in the actual distribution 
of By inserting the approximation (C2) in the 

distribution (Cl) and then by averaging over Poo(X), we 
obtain: 


P^^(Q^^^)=Ar P; 


^ 2 X {2Sd -axY 


M 




pUU(QMO)^^ P; 


Ti. 2 X 


M{2-a)i 


(Q“^) . 


Poo(X), see eq.s (29) and (30), allows us to compute 
the expected deviation of the log-likelihood from /[X*]. 
By averaging ^/[X], see eq. (26), we obtain: 

/w\ _ ^ ^ “ 

Wps^--jMi2-a) 

/= \[jM{2-a) 

for the DD algorithm and 


{SI) 


pVG — ~ 

oo 


aB 


E 


2 - a\^ 


v;[x] = v[-|(x-x*)Ar(x-x*)] (C2) 


= B —(X* - X) , 
to finally obtain: 


(C3) 


fa v 




B Ar(X* - X); — 


MC\ 


for the VG (we do not report the slightly involved ex¬ 
pression of the variance, which also scales as I/M). Both 
estimations are indeed biased to lower values with large 
fluctuations (of the order of the bias itself). For the DD 
algorithm the bias depends just on P, g and M and con¬ 
sequently it is data independent. For the VG, instead, 
it depends strongly on the spectrum of and for large 
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Amax or a 2 /Amax it can reach very large values, thus 
nullifying the inference effort. 

Note that the three matrix appearing in the covariance 
of eq. (Cl) and in the mean and covariance of eq. (C4) 
are a priori different: the first is the model susceptibility, 
the second is the log-likelihood Hessian where the third 
is the model Fisher matrix. However, as explained in 
section IA for the MaxEnt inference problem these three 
matrices coincide providing the results (29) and (30). 


Appendix D: The posterior sampling avoids to 
over-estimate the log-likelihood 

To better understand the consequences of over-fitting 
we consider now the case where the system that gener¬ 
ates the data is of MaxEnt form with some unknown true 
fields X. By ideally sampling the distribution infinitely 
many times we can access to the true means of the con¬ 
jugated observable P and from these compute the true 
log-likelihood: 


r=P-X-lnZ[X] . (Dl) 

We like to compare the true log-likelihood with that ob¬ 
tained by the exact inference, the one leading to X* and 
with that obtained by the DD algorithm which samples 
from the posterior. By sampling B times from the true 
model distribution we can generate synthetic dataset and 
obtain empirical estimates P of P. Through the central 
limit theorem we can approximate the distribution of the 
expected deviation of the observable means: 


P( (5P ) =P( P-P ) = AT 




(P) (D2) 


where y = x[X] is the susceptibility matrix of true model. 

The exact inference of the fields perfectly reproducing 
P overestimates the log-likelihood, in fact: 


— max 

^ X 


= max 
(5X 


(P + (5P) •X-lnZ[X]] 

(P + (5P) ■ (X + (5X) - In Z[± + 5X.] 


I + SP • X + max 
(5X L 


6P^6X- -6X • X • (5X 
2 


= T^ 5P-5P-SP 


(D3) 


where we approximate In Z[X + (5X] up to the second or¬ 
der. Through the expression (D2) we can approximate 
the distribution of /|p over many realization of the syn¬ 
thetic experiment: 


P( /* ) = AT 


r, D x-x-x 

^ 2P ’ B 


in 


(D4) 


where as before, D is the dimension of the fields vector 
and we discard terms of order B~^. /* is on average 

positively biased by a factor 



(reflected) obtained from the inference of S' = 2 • 10^ empiri¬ 
cal estimation of ^P. See text for the details of the synthetic 
pairwise Ising model used for the simulations. Arrows in¬ 
dicate the corresponding histograms empirical means. The 
black vertical bar indicates the numerically exact value of /, 
the exact model log-likelihood, whereas the black horizontal 
segment measures D/{2B). Dotted lines represents the the¬ 
oretical distributions, see eqs. (D4) and (D7). As predicted, 
see text, the average of lY^ coincides with /, whereas the av¬ 
erage of suffers by a bias of D/{2B) 


In the case of the DD algorithm {a = 1 and M = B) 
we have to substitute the maximization over X with an 
integration over the stationary fields distribution (30), 
which in this case will read: 


pDD( SX\6P) = pDD( X - X I ,5P ) 


= Af 


X 


-1 


SP ; 


X ^ ^ 


B 


m (D5) 


where the mean equals value of (iX after the maximiza¬ 
tion in the calculation of /|p, see eq. (D3). Eor the DD 
algorithm we obtain: 


^l + 6P ■±+ (^6P ■ 5X- -6X -x-SX^ 

= T+sp-x + bp • • < 5 P - T 

2 2B 


-r ^ 


(D6) 


and again through (D2) we obtain: 


p{i 


DD 


= Ar 


I ; 


x-x-x 

B 


{rn , (D7) 


where again we discard terms of order B~^. Coherently, 
the integration over the posterior distribution, that pre¬ 
vents to exactly maximize the log-likelihood, cancels the 
bias in the average of (D7). 

To test these results we perform an analysis on a 
pairwise Ising model of = 10 units, thus restrict¬ 
ing the observables list to single and pairwise corre- 
lations ({I]a}f=i = {WiW=i, and con- 

sistently the fields to biases and pairwise interactions 
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{{Xa}a=i = As synthetic model 

we chose a diluted disorder Ising model on a Erdos-Renyi 
random graph with average connectivity c = 4: 

Jij = 1.33 , with prob. 2 (n-i) 

Jij = -1.33 , with prob.2(^^ ^ 

Jij =0, , otherwise 

hi = -0.46 ^^j^iJij 

We first computed I = 5.7092 and then we simulated 
the model S = 2 ' 10^ times to collect S empirical esti¬ 
mates of ^P. The number of MCMC sampled was fixed 
at B = 2^^. For each of these S realization we estimate 
/|p and ifp and in Fig. 6 we compare their histograms. 
Dotted lines correspond to the theoretical distributions 


(D4) and (D7), whereas the arrows indicate the empiri¬ 
cal means. The black vertical lines represents 1. As can 
be appreciated by the small difference between the black 
vertical bar and the blue arrow, the average over the pos¬ 
terior distribution distribution removes the bias produced 
by the exact inference. Moreover the difference between 
the red and blue arrows is approximately D/{2B), equals 
to the length of the horizontal segment. More precisely: 

T- (igp) = (-35 ± 3.0) • lO""^ 

\ =(-1.2 ±3.0)-10-^ 

2B \° /p(SP) 

T-(lf^) = (-0.3±0.6) • 10“'‘ 

\ /P((5P) 
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